A stochastic model incorporates randomness; given the same inputs, a stochastic model may generate different outputs on different runs.
This is useful when the process you want to model is fundamentally stochastic, or if there is heterogeneity in its dynamics.
Unfortunately stochastic models can get complicated, and come with a lot of notation and formalism. Our goal here is to quickly review basic probability theory so that we can use stochastic models productively.
A random variable is a variable whose value is random. In this course, random variables will usually have some real-world meaning, like:
Probability \(\Pr(\cdot)\) is a function whose argument is the value a random variable can take on, which returns a positive number less than or equal to 1. The set of all such values describes the probability distribution.
For discrete random variables, we will write the probability mass function \(\Pr(X=x)\). For continuous random variables, we will use the density, defined below.
We usually describe random varibles in terms of their probability distribution. For example, consider the random variable \(X\) defined by \(\Pr(X=0) = 1/2\) and \(\Pr(X=1)=1/2\). What might \(X\) refer to?
All random variables on the real line have a cumulative distribution funciton (CDF) \[ F(x) = \Pr(X < x) \] The cumulative distribution function fully characterizes the probability distribution of \(X\).
The density of a continuous random varible is \(f(x)\), the derivative of CDF \(F(x)\) of \(X\), where it exists \[ f(x) = \lim_{h\to 0} \frac{F(x+h) - F(x)}{h} \]
\(\Pr(Y=y|X=x)\) is the probability that \(Y=y\), given that \(X=x\). We interpret the statement on the right-hand side of the conditioning sign \(|\) as deterministic. Bayes’ theorem says how to compute conditional probabilities:
\[ \Pr(Y=y|X=x) = \frac{\Pr(Y=y,X=x)}{\Pr(X=x)} \]
The expected value of a random variable is its “average” value.
\[ E[X] = \sum_x x \Pr(X=x) \]
or
\[ E[X] = \int_{-\infty}^\infty x f(x) dx \]
where \(f(x)\) is the density of \(X\). The expectation is the average of values x that X can take on, weighted by their probability.
Can a random variable \(X\) have expectation equal to a value that \(X\) can never take? Yes. If \(X\sim \text{Bernoulli}(1/2)\), then \(\Pr(X=x)=1/2\) for \(x=1\) and \(x=0\). But \(E[X]=1/2\).
Conditional expectations are just expectations with respect to a conditional probability distribution. \[ E[Y|X=x] = \sum_y y\Pr(Y=y|X=x) \] \[ E[Y|X=x] = \int_{-\infty}^\infty y f(y|X=x) dy \]
\[ \Pr(X=x) = \sum \Pr(X=x|Y=y) \Pr(Y=y) \] \[ \Pr(X=x) = \int \Pr(X=x|Y=y) f(y) dy \]
Law of total expectation
\[ E[X] = \sum E[X|Y=y] \Pr(Y=y) \] \[ E[X] = \int_{-\infty}^\infty E[X=x|Y=y] f(y) dy \]
Two random variables are independent if their joint probability factorizes into the product of their marginal probabilities, \[ \Pr(X=x,Y=y) = \Pr(X=x) \Pr(Y=y) \] or \(f(x,y) = f(x)f(y)\). e.g. independent coin flips, die rolls, etc.
This is the classic coin-flipping distribution.
The random variable \(X\) has Bernoulli\((p)\) distribution if \(\Pr(X=1)=p\) and \(\Pr(X=0)=1-p\).
A Binomial\((n,p)\) random variable is a sum of \(n\) Bernoulli variables, each independent with probability \(p\). The probability mass function is \[ \Pr(X=k) = \binom{n}{k} p^k (1-p)^{n-k} \] where \(0 \le k \le n\).
A Geometric\((p)\) is the number of Bernoulli\((p)\) trials to achieve a given number successes. The probability mass function is
\[ \Pr(X=k) = (1-p)^{k-1} p \] where \(0 \le k\).
A common probability model for count data is \[ \Pr(X = k) = \frac{(\lambda t)^k e^{-\lambda t}}{k!} \] This distribution looks weird, but it has a nice story, which we will see when we talk about Poisson processes.
Usually defined on \([0,1]\). \(X \sim \text{Unif}(0,1)\) means the density is \[ f(x) = 1 \] for \(0\le x \le 1\).
If \(X\) has beta distribution, the density is
\[ f(x) \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} \]
for \(0\le x \le 1\), where
\[ B(\alpha,\beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)} \]
If \(X\) has exponential distribution, the density is
\[ f(x) = \lambda e^{-\lambda x} \]
\[ F(x) = 1-e^{-\lambda t} \]
Memoryless property: \[ \Pr(X>t+h| X>t) = \frac{\Pr(X>t+h)}{\Pr(X>t)} = \frac{e^{-\lambda(t+h)}}{e^{-\lambda t}} = e^{-\lambda h} = \Pr(X>h) \]
The normal\((\mu,\sigma^2)\) density is
\[ f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left[ -\frac{1}{2\sigma^2} (x-\mu)^2 \right] \]
Where does this functional form come from? It’s complicated.
There are lots of other “canonical” probability distributions
Discrete: Negative binomial, beta binomial, counting distributions
Continuous: Gamma, Lognormal, Laplace, survival/accelerated failure time distributions
We won’t use these in this class, but it’s important to know that there are lots of options for modeling random quantities. This is partly what the field of statistics is about.
A decision tree represents a sequence of decisions, actions, or states. You can think of an individual jumping from left to right along the nodes in the tree. At each node, they follow a link with the given probabilities. This means their path through the tree is stochastic.
Simple decision tree for the treatment of individuals diagnosed with a certain disease. Following diagnosis, each individual gets treatment A or B. Then either they survive or die by some follow-up time, depending on which treatment they got.
Computing “marginal” and conditional probabilities is easy via law of total probability: \[ \Pr(\text{Alive}|\text{Treatment A}) = 0.7 \] \[ \Pr(\text{Alive}|\text{Treatment B}) = 0.1 \] \[ \Pr(\text{Alive}) = 0.4 \times 0.7 + 0.6 \times 0.1 = 0.34 \]
The total probability of being alive is \[ \Pr(\text{Alive}) = 0.9 \times 0.6 + 0.1 \times 0.9 = 0.63\]
Let’s look at the alive people and see which treatment they got: \[ \Pr(\text{Treatment A}|\text{Alive}) = 0.9 \times 0.6 / 0.63 = 0.86 \] \[ \Pr(\text{Treatment B}|\text{Alive}) = 0.1 \times 0.9 / 0.63 = 0.14 \]
Most of the alive people got Treatment A. Treatment A must be best! Is that true?
In this course, we will focus on Markov stochastic models in discrete and continuous time. A Markov model is one whose next event depends only on its current state, and not on its prior history (Bailey and Bailey (1990), Renshaw (2015))
Let \(X(t)\) be a Markov process taking \(n\) different states, in discrete time, \(t=1,2,\ldots\). Define the matrix of one-step transition probabilities from each state \(i\) to each state \(j\),
\[ \mathbf{P} = \begin{pmatrix} p_{11} & p_{12} & \cdots & p_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{n1} & p_{n2} & \cdots & p_{nn} \end{pmatrix} \]
Define the transition probability of moving from state \(i\) to state \(j\) in time \(t\) as
\[ P_{ij}(t) = \Pr(X(t) = j | X(0) = i) \]
You can compute transition probabilities by multiplying \(\mathbf{P}\) by itself \(t\) times:
\[ P_{ij}(t) = [\mathbf{P}^t]_{ij} \]
We described deterministic systems in terms of their rates earlier. But a continuous-time stochastic model produces different output each time we run it. How can we describe its rates of change when its path is different every time?
We need to describe the rate of change of its transition probabilities. Consider a continuous-time Markov chain \(X(t)\) on a discrete state space.
The transition rate between states \(i\) and \(j\) is \(\lambda_{ij}\) and define the rate matrix
\[ \mathbf{Q} = \begin{pmatrix}
-\lambda_{11} & \lambda_{12} & \cdots & \lambda_{1n} \\
\vdots & \vdots & \ddots & \vdots \\
\lambda_{n1} & \lambda_{n2} & \cdots & -\lambda_{nn}
\end{pmatrix}
\]
where \(\lambda_{ii} = \sum_{j\neq i} \lambda_{ij}\). The transition probabilities \(P_{ij}(t) = \Pr(X(t)=j|X(0)=i)\) can be computed by:
\[ \mathbf{P}(t) = \exp[\mathbf{Q} t] \]
Consider a continuous-time Markov model at time \(t\), where \(X(t)=i\). The rate of jumping to state \(j\) is \(\lambda_{ij}\). What does this mean?
It means that the jump to \(j\) occurs after a waiting time with distribution Exponential\((\lambda_{ij})\). If there are many possibilities \(k\) with \(\lambda_{ik}>0\), then the waiting time to the next jump to any state is the minimum of all these exponential waiting times. Fortunately,
\[ \min\{W_{ij}:\ j\neq i\} \sim \text{Exponential}\left(\sum_{j\neq i} \lambda_{ik}\right) \]
Furthermore, the destination of the next jump is independent of this waiting time.
Several properties of CTMCs are useful:
Drawbacks:
Now consider a continuous-time Markov chain whose only positive transition rates are to the next integer state, \(\lambda_{i,i+1}=\lambda\). This is called a counting process. Let \(X(t)=i\), so the waiting time to the jump to \(i+1\) is
\[ W_i \sim \text{Exponential}(\lambda) \]
following the last event. Given \(X(0)=0\), what is the distribution of the number of jumps \(X(t)\)?
\[ X(t) \sim \text{Poisson}(\lambda t) \]
So \(E[X(t)] = \lambda t\). This is where the Poisson distribution comes from.
A queue is model for a list of individuals waiting to be served. Examples:
The simplest queue is a continuous-time Markov model with \(\lambda_{i,i+1}=\lambda\) and \(\lambda_{i,i-1}=\mu\).
A slightly more complicated queue has \(\lambda_{i,i+1}=\lambda\) and \(\lambda_{i,i-1}=i\mu\).
Queues are useful because they can be combined into a stochastic model for sequential movement through a series of compartments or states.
Gonsalves et al. (2017)
The “Gillespie algorithm” performs forward simulation of a stochastic compartmental model (Pineda-Krch (2008)). You don’t really need to know how it works. Forward simulation of continuous-time Markov models is computationally straightforward.
The ssa function in the GillespieSSA package implements a fast version of the Gillespie algorithm.
\(X(t)\) counts the number of individuals in a growing population. The transition rates are \(\lambda_{i,i+1}=i\lambda\).
plot(out$data[, 1], out$data[, 2], col = "red",
pch = 19, xlab = "Time", ylab = "Number",
bty = "n")ssaHere, x0 is the starting state, a is the transition rate, nu is the change in the compartment, parms is the parameters, and tf is the final time.
The function returns a data frame with jump times and states.
## X
## timeSeries 0.000000 1
## 1.679942 2
## 2.265076 3
## 2.360926 4
## 2.369877 5
## 2.618956 6
parms <- c(b = 2, d = 1, K = 1000)
x0 <- c(N = 500)
a <- c("b*N", "(d+(b-d)*N/K)*N")
nu <- matrix(c(+1, -1), ncol = 2)
out <- ssa(x0, a, nu, parms, tf = 10, simName = "Logistic growth")plot(out$data[, 1], out$data[, 2], type = "l",
xlab = "Time", ylab = "Number", bty = "n")
abline(h = 1000, lty = "dashed", col = "gray")plot(out$data[, 1], out$data[, 2], col = "green",
ylim = c(0, 500), type = "l", xlab = "Time",
ylab = "Number", bty = "n")
lines(out$data[, 1], out$data[, 3], col = "red")
lines(out$data[, 1], out$data[, 4], col = "blue")
legend(50, 500, c("S", "I", "R"), pch = 16,
col = c("green", "red", "blue"))“Statistical models” are stochastic because they contain random variables. Consider the usual Normal/Gaussian linear model \[ y = \alpha + \beta x + \epsilon \] where \(\epsilon \sim N(0,\sigma^2)\). This model is stochastic because \(\epsilon\) is random.
In my opinion, you should think about so-called statistical models as mechanisitic.
Bailey, Norman T, and Norman TJ Bailey. 1990. The Elements of Stochastic Processes with Applications to the Natural Sciences. Vol. 25. John Wiley & Sons.
Gonsalves, Gregg S, A David Paltiel, Paul D Cleary, Michael J Gill, Mari M Kitahata, Peter F Rebeiro, Michael J Silverberg, et al. 2017. “A Flow-Based Model of the Hiv Care Continuum in the United States.” Journal of Acquired Immune Deficiency Syndromes (1999) 75 (5). NIH Public Access: 548–53.
Pineda-Krch, Mario. 2008. “GillespieSSA: Implementing the Stochastic Simulation Algorithm in R.” Journal of Statistical Software 25 (12): 1–18.
Renshaw, Eric. 2015. Stochastic Population Processes: Analysis, Approximations, Simulations. Oxford University Press.